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ABSTRACT 

Motivation: Accurate haplotyping— determining from which parent 
particular portions of the genome are inherited— is still mostly an un- 
resolved problem in genomics. This problem has only recently started 
to become tractable, thanks to the development of new long read 
sequencing technologies. Here, we introduce ProbHap, a haplotyping 
algorithm targeted at such technologies. The main algorithmic idea of 
ProbHap is a new dynamic programming algorithm that exactly opti- 
mizes a likelihood function specified by a probabilistic graphical model 
and which generalizes a popular objective called the minimum error 
correction. In addition to being accurate, ProbHap also provides con- 
fidence scores at phased positions. 

Results: On a standard benchmark dataset, ProbHap makes 11% 

fewer errors than current state-of-the-art methods. This accuracy 

can be further increased by excluding low-confidence positions, at 

the cost of a small drop in haplotype completeness. 

Availability: Our source code is freely available at: https://github.com/ 

kuleshov/ProbHap. 

Contact: kuleshov@stanford.edu 



I INTRODUCTION 

Although modern sequencing technology has led to rapid ad- 
vances in genomics over the past decade, it has largely been 
unable to resolve an important aspect of human genetics: gen- 
omic phase. Each human chromosome comes in two copies: one 
inherited from the mother, and one inherited from the father. 
Despite the fact that differences between these copies play an 
important biological role, until recently, decoding these differ- 
ences (a process known as haplotyping or genome phasing) has 
been a major technological challenge. 

In recent years, however, we have seen an emergence of new 
long read technologies (Kaper et al., 2013; Kitzman et al., 2010; 
Peter et al., 2012; Voskoboynik et al., 2013) that may one day 
enable routine cost-effective haplotyping. Because a long read 
comes from a single chromosome copy, it reveals the phase of 
all heterozygous genomic positions that it covers. By connecting 
long reads at their overlapping heterozygous positions, it is pos- 
sible to extend this phase information into haplotype blocks, in a 
process referred to as single-individual haplotyping (SIH) 
(Browning and Browning, 2011). 

Although from the molecular biology side, routine haplotyp- 
ing seems close to becoming a reality, dealing with long read data 
remains non-trivial computationally. Under most formulations 
of the problem, it is NP-hard to recover the optimal haplotypes 
from noisy sequencing reads (Gusfield, 2001). This has led to a 
vast literature on heuristics for dealing with this problem as ac- 
curately as possible. 

Here, we propose a new algorithm, ProbHap, which offers an 

I I % improvement in accuracy over the current leading method. 



RefHap. Unlike most other algorithms, ProbHap also provides 
confidence scores in addition to genomic phase. These scores can 
be used to prune low-accuracy positions and further improve 
haplotype quality, at the cost of phasing fewer variants. 

The main algorithmic ideas of ProbHap are a new dynamic 
programming algorithm and a probabilistic graphical model. 
The dynamic programming algorithm determines the haplotypes 
that maximize the likelihood function P(reads| true haplotypes) 
specified by the probabilistic model as well as the probability 
that these haplotypes are correct. It can be seen as a special 
case of the well-known variable elimination algorithm (Koller 
and Friedman, 2009). 

From a theoretical point of view, the likelihood function spe- 
cified by our probabilistic model generalizes a well-known ob- 
jective called the minimum error correction (MEC). Previously 
proposed exact dynamic programming algorithms for the MEC 
can be easily derived as special cases of the general variable elim- 
ination algorithm within our model. More interestingly, alterna- 
tive formulations of this algorithm (corresponding to different 
variable orderings) result in novel exact algorithms that are sig- 
nificantly faster than previous ones. Thus, our work generalizes 
several previous approaches and provides a systematic way of 
deriving new haplotyping algorithms. 



2 RELATED WORK 

Most phasing algorithms solve a formally defined computational 
problem called sih, in which the goal is to minimize an objective 
called the MEC (see Section 5). This objective is NP-hard 
(Gusfield, 2001); therefore, most early work on the sih problem 
involved simple greedy methods (Geraci, 2010). More recently, 
these methods have been superseded by more sophisticated heur- 
istics such as RefHap (Duitama et al., 2012) or HapCut (Bansal 
and Bafna, 2008) that involve solving a Max-Cut problem as a 
subroutine. There is also an exact dynamic programming solu- 
tion to the SIH problem; its running time is exponential in the 
length of the longest read (He et al., 2010). 

Several probabiHstic approaches have also been previously 
proposed, including HASH (Bansal et al., 2008), MixSIH 
(Matsumoto and Kiryu, 2013) and an algorithm used for recon- 
structing the diploid genome of Ciona intestinalis (Kim et al., 
2007). These methods optimize an objective function similar to 
that of ProbHap using heuristics based on Markov chain Monte 
Carlo (MCMC). They differ in the way in which they implement 
MCMC. In addition, MixSIH (Matsumoto and Kiryu, 2013) is 
to our knowledge the only package that also provides confidence 
scores at phased positions. 

Probabilistic graphical models are widely used in the statistical 
phasing literature to determine haplotypes from a panel of indi- 
viduals using linkage disequilibrium patterns. However, the vast 
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majority of statistical methods do not use the partial phase in- 
formation provided by long reads, and are not applicable to our 
setting. A notable exception is a recent method called Hap-Seq 
(He et aL, 2012); without its statistical component it reduces to 
the well-known exact exponential-time algorithm mentioned 
above (He et aL, 2010). 

Also, there exists an extensive Hterature on the SIH problem 
from the perspective of combinatorial optimization (Lippert 
et aL, 2002). Research in this field is aimed at optimizing com- 
binatorial objectives such as minimum fragment removal, min- 
imum SNP removal or MEC. This research is of a more 
theoretical nature and aims at providing a rigorous theoretical 
understanding of the SIH problem (Lippert et aL, 2002). 

3 RESULTS 

3.1 Overview of ProbHap 

ProbHap is based on a new exact dynamic programming solu- 
tion for the SIH problem, which makes it more accurate than 
many existing methods. Its main drawback is a higher computa- 
tional cost: its worst-case running time increases exponentially 
with the read coverage. Fortunately, modern long read technol- 
ogies cover the genome at a relatively low depth (Duitama et aL, 
2012; Kitzman et aL, 2010), making it possible to apply our al- 
gorithm to such data. In cases when the coverage is extremely 
high, ProbHap also uses a preprocessing heuristic to merge simi- 
lar reads (see Section 4). In our experience, ProbHap handles 
long read coverages of up to 20x; however, it is not appropriate 
for higher coverage short read datasets. 

The output of ProbHap is a set of haplotype blocks in the 
format of RefHap and HapCut. In addition, ProbHap also pro- 
duces at each position three confidence scores that can be used to 
identify locations where the phasing results are less accurate. The 
posterior score represents the probability of correctly determining 
the phase of a SNP with respect to the first SNP in the block. The 
transition score represents the probability of correctly determin- 
ing the phase of a SNP with respect to the previous one. Finally, 
the emission score is often helpful in finding sequencing errors 
and other issues with the underlying data. 

Whenever the transition score is too low, we suggest breaking 
the haplotype block at a position. Whenever the posterior or the 
emission scores are low, we suggest leaving that position 
unphased. 

3.2 Comparison methodology 

We compared ProbHap to three state-of-the art algorithms — 
RefHap (Duitama et aL, 2010), FastHare (Panconesi and 
Sozio, 2004) and DGS (Panconesi and Sozio, 2004) as well as 
to HapCut (Bansal and Bafna, 2008), a historically important 
phasing package, and to MixSIH (Matsumoto and Kiryu, 2013), 
the only method that we know that produces confidence scores. 
Previous studies (Duitama et aL, 2012; Geraci, 2010) have iden- 
tified the above methods as being the current state-of-the-art in 
single-individual haplotype phasing. 

Note that we do not compare our method to HapSeq (He 
et aL, 2012) because this package additionally uses population- 
based statistical phasing techniques to improve accuracy. We 
also do not consider previously proposed exact dynamic 



programming methods (He et aL, 2010), as they do not scale 
to long reads: their running time increases exponentially in the 
number of variants in a read, and some of the reads in our 
datasets have >50 variants. 

The heuristics we consider work as follows. In brief, FastHare 
sorts the input reads, and then traverses this ordering once, 
greedily assigning each read to its most probable chromosome 
given what has been seen so far. The DGS method is equally 
simple: it iterates until convergence between assigning each frag- 
ment to its closest chromosome, and recomputing a set of 
consensus haplotypes. The RefHap and Hapcut algorithms con- 
struct a graph based where each vertex is either associated with a 
position (HapCut) or with a sequencing read (RefHap); then, the 
algorithms approximately solve a MaxCut problem on this 
graph. 

We test the above methods on a long read dataset from 
HapMap sample NA 12878 that was produced using a fosmid- 
based technology (Duitama et aL, 2012). The long reads have an 
average length of '^40kb and cover the genome at a depth of 
^3 X . This dataset is a standard benchmark for SIH algorithms 
(Duitama et aL, 2012; Matsumoto and Kiryu, 2013) in part 
because HapMap sample NA 12878 has also been phased mul- 
tiple times based on the genomes of its parents. In this work, we 
take the trio-phased variant calls from the GATK resource 
bundle (DePristo et aL, 2011); these provide accurate phase at 
1 342091 heterozygous variants that are also present in the long 
read dataset. 

We measure performance using the concept of a switch error 
(Browning and Browning, 2011). A switch error is said to occur 
when the true parental provenance of SNPs on a haplotype 
changes with respect to the previous position. For example, if 
the true SNP origins of a phased block can be written as MMFF, 
then we say there is a switch error at the third position. In this 
analysis, we differentiate between two types of switches: a long 
switch corresponds to an inversion that lasts for more than one 
position (e.g. MMFF); a short switch, on the other hand, affects 
only a single position (e.g. MMFM). Switch accuracy is defined 
as the number of positions without switch errors, divided by the 
number of positions at which such errors could be measured. 
Long switch accuracy is defined accordingly in terms of long 
switch errors. We also measure accuracy in terms of switches 
per megabase (Sw./Mb). 

Finally, a block N50 length of x signifies that at least 50% of 
all phased SNPs were placed within blocks containing x SNPs or 
more. The percentage of SNPs phased was defined as the number 
of SNPs in blocks of length two or more, divided by the total 
number of SNPs. 

3.3 Results 

Given comparable phasing rates and N50 block lengths, 
ProbHap produced haplotype blocks with more accurate long- 
range phase: the long-range switch error of ProbHap was 1 1 % 
lower than that of the second best algorithm, RefHap (Table 1). 
In addition, ProbHap also produced 6% fewer short switch 
errors than RefHap. 

Note that long switch accuracy is substantially more import- 
ant than short switch accuracy, as it drastically changes the 
global structure of haplotypes. Short switch errors, on the 
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Table 1. Comparison of algorithm performance 



Accuracy/N50 tradeoff 



Algorithm 


Long sw./Mb 


Short sw./Mb 


% phased 


N50 


ProbHap 


1.07 


3.70 


91.83 


227 


Refhap 


1.20 


3.91 


91.75 


226 


FastHare 


1.32 


4.03 


91.76 


227 


DGS 


1.48 


4.18 


91.66 


227 


HapCut 


1.61 


4.93 


91.61 


227 


MixSIH 


1.41 


5.43 


92.64 


229 



Haplotype accuracy 




10"^ 10"^ 
Probability cutoff 



Fig. 1. Accuracy/completeness trade-off for ProbHap 



other hand, introduce relatively small amounts of noise in the 
data. 

3.4 Evaluating confidence scores 

In addition to being more accurate, ProbHap is also one of the 
few algorithms which can provide estimates of their accuracy in 
the form of confidence scores. As an example of how such scores 
might be used, we pruned phased positions that were deemed by 
ProbHap to be uncertain and measured the resulting accuracy. 

More specifically, we defined thresholds for each of the three 
confidence scores reported by ProbHap. Whenever the posterior 
or emission scores were lower than a threshold, we treated that 
position as unphased. Whenever the transition probability was 
below a threshold, we split the phased block into two parts at 
that position. 

Figure 1 shows that after pruning, one obtains phased blocks 
that are 30^0% more accurate than the unpruned blocks (recall 
that we describe them in Table 1); the price to pay is a drop of 
10-25% in N50 and phasing rate. The particular numbers shown 
in Figure 1 were achieved by fixing the posterior and transition 
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u 

I 98.6 
98.4 





— ProbHap 




— MixSIH 




200 250 300 350 

N50 

Accuracy/percent phased tradeoff 




0.90 0.91 0.92 0.93 
Percent pliased 

Fig. 2. Comparison of the accuracy/completeness trade-off of ProbHap 
and MixSIH. The top panel compares the trade-off between the N50 and 
the phasing accuracy; the phasing rate was the same for both algorithms 
at each point. Similarly, the bottom panel examines the phasing rate 
trade-off 



Table 2. Running time of each algorithm on chromosome 22 



Refhap 


FastHare 


DGS 


MixSIH 


ProbHap 


Running time 3.65 s 


1.85s 


1.99 s 


274.82 s 


58.53 s 



cutoffs to 0.6 and 10 ^, respectively, and setting the emission 
cutoff to 10-^ lO-\ 10-^ 10"^ 0.05, 0.1, 0.4 and 0.99. 

Next, we compared the pruned regions from ProbHap to those 
of MixSIH, the only other package that allows the user to ex- 
clude low-confidence positions. We chose thresholds so as to 
keep either the N50 or the phasing rate constant across both 
algorithms, and measured how accuracy varied with the remain- 
ing non- fixed parameter. We present the results of our experi- 
ment in Figure 2. 

Overall, we see that given the same level of haplotype com- 
pleteness, the pruned blocks of ProbHap contain 20-30% fewer 
switching errors than those from MixSIH. 

3.5 Running time 

We measured the running times of the algorithms on a laptop 
computer (Table 2). We did not include HapCut in this compari- 
son, as it is several orders of magnitude slower that the other 
methods (Duitama et al., 2012). Although the three heuristics ran 
faster than ProbHap and MixSIH, a major reason for their speed 
was due to not having to compute confidence scores. In fact, 
ProbHap spends roughly two- thirds of its running time 
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Table 3. Example of a 2 x 4 phasing matrix M, in which two reads cover 
three positions each 



where 





1 


2 


3 


4 


Read 1 


0 


1 


0 




Read 2 




1 


0 


0 



computing such scores. Nonetheless, it phases chromosome 22 in 
just under a minute; the total time for phasing a human genome 
was under 30 minutes. 



4 METHODS 

4.1 Notation 

Formally, an instance of the sih problem is defined by a pair of nxm 
matrices M, Q, whose columns correspond to heterozygous positions 
(indexed by 7= 1, ... , m), and whose rows correspond to reads (indexed 
by i= I, . . . , n). We refer to M as the phasing matrix; its entries take 
values in the set {0, 1, — }. These values indicate the allele carried by a 
read at a given position: for example. My = 0 signifies that read / covers 
position j and carries allele 0 at J. A value of - indicates that read / did not 
cover position J. See Table 3 for an example of a 2 x 4 phasing matrix. 

The nxm matrix Q e [0, 1]"^"^ is referred to as the q-score matrix; it 
encodes the probability of observing a sequencing error at a given pos- 
ition in a read. Such scores are available on virtually all sequencing 
platforms. 

A solution to an instance of the sih problem consists of a pair of 
vectors h e {0, 1}"' and r e {0, 1}". The former determines the subject's 
haplotypes: at each genomic position J, it specifies an allele hj e {0, 1}. We 
consider only one haplotype, as the second is always the complement h of 
the first. The second vector r e {0, 1}" indicates the true provenance 
r, G {0, 1} of each read i (i.e. whether i was obtained from the 'maternal' 
or the 'paternal' copy; because we do not have information to deter- 
mine which copy comes from which parent, we refer to them as 0, 1). 
We also use 



hjird- 



hj ifr/ = 0 
hj if Ki = 1 



to denote alleles on the haplotype from which read i originated. 

Next, let Fo(i) = \J\Mij / — } denote the set of positions covered by 
read /. Let also Hj = {/zy|min Po(z) <7 < max Po(0} be the set of haplo- 
type variables spanned by read / and let Rj = {ri\mmFo(i) < 
j < max Po(/)} be the set of read provenance variables spanning a pos- 
ition j. We will use this notation to simplify several expressions through- 
out the article. In particular, if position J is spanned by, say, reads 2, 3, 
then we will use the notation maxj?. /(i?y) = max,^2,r3 /(^2, ^3) and 
Ei.,/(^;) = E.„.3/(^2,r3). 



4.2 Probabilistic model 

We define the probability P(r, h, o) over haplotypes h e {0,1} 
ments of reads r e {0, 1}" and observed data o e {0,1, — }"^' 
product of factors 



P(r,h,o) = Y\ Yl P(Oij\r^,hj)Y\P(rdY\P(hj), 

z=1./:/gPo(0 '=1 .;=1 



, assign- 
to be a 



P(0ij\rt,hj)-- 



Qij if Oij 7^ hj{r,) 
1 - Qij if Oij = hj{ri) 



is the probability of observing the allele on the y-th position in read /, and 
the factors P(ri) and P(hj) are priors that we leave as uniform, except for 
P(h\ =0)= 1. This last choice eliminates the ambiguity stemming from 
the fact that a solution h can be always replaced with its complement h; it 
resolves this ambiguity by always choosing the solution with h\=0. 
Finally, note that the r and h variables are hidden, while the o variables 
are observed; the observed values are defined by the matrix M. 

The dependency structure of P can be represented in terms of a 
Bayesian network whose topology mirrors the two-dimensional structure 
of the matrix M. See Figure 3 for the Bayesian network associated with 
the phasing matrix in Table 3, which we gave earlier as an example. 

4.3 Maximum likelihood haplotypes 

We determine maximum-likelihood haplotypes h* = arg max/j log P(o = 
M\h) using the belief propagation algorithm, also known as max-sum 
message passing over a junction tree (Koller and Friedman, 2009). In 
brief, this algorithm involves groups of variables passing each other in- 
formation about their most likely assignment; a well-known special case 
of this method is the Viterbi algorithm for hidden Markov models 
(HMMs). 

4.3.1 Definition of max-sum message passing We start by briefly 
defining the max-sum message passing algorithm for graphical models. 
Readers familiar with the subject may skip this subsection. 

Definition 1. Let P be a probability over a set of variables 
X= {x\, . . . , Xn} that is a product of k factors 

each factor 0,- being defined over a subset of variables Xi ^ X. A Junction 
tree T over P is a tree whose set of nodes is a family of subsets 
C={Ci, . . . , Cm], with Cj c X and that satisfies the following properties: 

(1) For each factor there is a cluster c(i) such that Xi c Cc(/). 

(2) (Running intersection) Ifxe Q and x e Cj, then x e Ck for all C^ 
on the unique path from Ci to Cj in T. 

Given this definition, we now define max-sum message passing. We 
restrict our definition to the case when the junction tree 7" is a path, which 
is going to be the case for our model. 

Definition 2. Let P be a probability distribution as in Definition L Let T 
be a junction tree over clusters Cjfor j=l, . . . ,m connected into a path and 
ordered by J, with Cm serving as the root. The max-sum message from Cj to 
Cj+ 1 is a function Mj defined over the variables in Cj H C/+ 1 as 

Mj{Cjf^Cj+l)^ max ( ^ log (/>,(X0 + My_i(q_i n q) ) , 

\/:c(0=/ / 

with the additional definition that Mq = 0. 

The max-sum message passing algorithm recursively computes the 
above messages and determines that max is 



max 



( J2 iog0.«) + M,,_i(c^_inc„o). 



The actual assignment that maximizes P can be found by storing the 
variable assignments that maximize each Mj. Unfortunately, proving 
the correctness of this algorithm is beyond the scope of this article. For 
a complete discussion that holds for arbitrary junction trees, we refer the 
reader to a textbook on graphical models (Koller and Friedman, 2009). 
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Fig. 3. Bayesian network associated with the problem instance defined in 
Table 3. The shaded nodes represent hidden variables; unshaded variables 
are observed. Variables belonging to cluster C3 of the associated junction 
tree are shown in bold 

4.3.2 Applying max-sum message passing to the ProbHap 
model We now define how the max-sum message passing algorithm 
is applied to the graphical model we defined in Section 4.2. 

Definition 3. Let T be a junction tree for P defined by clusters 

Cj = [fi, hj, Oylmin Po(/) < 7 < max Po(z)} 

for j=l, . . . ,m connected into a path ordered by j, with serving as the 
root. 

Each cluster Cj contains hj and all the Ojj and r,- variables associated 
with reads that span across position j. For an example of one such cluster, 
see Figure 3. 

Lemma 1. The tree T in Definition 3 is a valid junction tree for the distri- 
bution P defined in Section 4.2. 

Proof. It is easy to check that the scope of each factor of P is in a unique 
cluster. We therefore focus on proving that Thas the running intersection 
property. 

Let Cx, Cy be two clusters in T with x <y, and let Q be a cluster on 
the path between and Cy. Because T is a path, we must have 
X Sy- We need to show that Cy fi ^ Q. 

Observe that by construction Cy fi C^^ can only contain r-variables. Let 
ri e CyP^ Cx be one such variable. We need to show that r/ g Q, i.e. that 
min Po(/) < z < max Po(/). 

From ri e Cy D Cx, we have that Po(/) < y < x < max Po(/). Because 
we also have x < z < y, our claim follows. ■ 

Now let Rjnj+\= Rj Rj+i and Rj\j+i= Rj\Rj+\. The interested 
reader may verify that the message from cluster j to cluster 7 + 1 
during a run of max-sum message passing with C^ as the root of T 
equals for7> 1, 

Mj{Rjnj+\) = max max V log |r/, hj) + Mj_i(Rj_inj) , (1) 

and for 7= 1, Mi(Rin2) = T0(\ax log P(on\ri, hi =0)^. Note 

that we disregard the priors P(ri), PQij) in all messages except the first 
because they are uniform. 

Intuitively, M(i?yn/+i) represents the maximum likelihood of the data 
at positions 1, ... ,7 assuming that reads spanning both 7 and 7 + 1 have 
provenances specified by Rj^j+i- The maximum of P is computed using 
the recursion 

max mjix I ^ log P(of^|rf, /z^) + M^_i(7^^_inm) I • 



Table 4. Example of a sequencing error that confounds the long-range 
structure of the haplotypes 





1 


2 


3 


4 


5 


Read 1 


0 


0 


1 


0 




Read 2 






0 


0 


0 



Note. If the quality scores are the same at all positions, the haplotypes h = 00000, 
h = 00111 have the same probability. 



4.3.3 Running time The above algorithm computes one message for 
each of m. A message specifies a value for each assignment of variables 
in Rjnj+i', this value is the maximum over all assignments to hj and to 
Rj\j+i, and for each such assignment, we need to compute J^i-neR- ^ 
(oij\rj, hj) in 0(\Rj\) time. Therefore, computing a message requires \Rj\ 
x2x2i^'^'+^l X 2l^A'+'l = |i?y|2l^^l + ^ iterations. Thus, the total running 
time of the algorithm is 0(m/c2'^^^), where K = m2iXj\Rj\ is the maximal 
coverage across all the positions. 

4.4 Confidence scores 

Next, we turn our attention to deriving confidence estimates for genomic 
regions. As an example of why such estimates are useful, we show in 
Table 4 that, somewhat counter-intuitively, two SNPs may be unphased 
even when they are connected by accurate reads. 

4.4.1 Motivating example In Table 4, the data contains sequencing 
errors at position 3 or 4. If the error occurs at position 3 (in either row), 
then the two reads come from the same haplotype and the correct solu- 
tion is h = 00000. If, on the other hand, the error occurs at position 4, 
then the two reads come from different chromosomes and the true haplo- 
type is h = 00111. If the quality scores are the same at all positions, the 
four errors are equally likely, and the haplotypes h = 00000, h = 00111 
have the same probabiHty. 

Simple optimization-based algorithms would Hkely produce a single 
haplotype in the above example; our probabiHstic model, however, would 
assign a transition probability of 0.5 to position 3. 

4.4.2 Dynamic programming recursion We again perform probabil- 
istic inference in our model using belief propagation. Our particular im- 
plementation of this method is inspired by the sum-product message 
passing algorithm (Koller and Friedman, 2009) over the previously 
defined junction tree T. In sum-product message passing, clusters of vari- 
ables pass to each other information about their local probabihty distri- 
bution; after two rounds of message passing (referred to as 'forwards' and 
'backwards'), the clusters become calibrated and can be queried for vari- 
ous probabilities. A well-known special case of this method is the for- 
wards-backwards algorithm for HMMs. 

More concretely, we compute for each node 7 two factors, F[hj, Rj] and 
B[hj, Rj], using the dynamic programming recursions below. 

Flhj,Rj] 

hj-i Rj-i-R, 
B[hj, Rj] 

= E E B[hj+iRj+i]P(Oj+i\hj+i,Rj+0P(Pj+i)P(hj+i) 

hj+x Rj+\-Rj 

The notation Rj^Rj_\ indicates that the r,- variables common to both Rj 
and Rj-i have been assigned the same value, and P(Oj\hj, Rj) is short- 
hand for Or/vGi? P{oij\ri, hj). It follows from our definition of the prior 
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PQii) that the initial values equal F\hi=0, Rx\ = P{Ri) and 
F\hi = 1, = 0; in addition, = 1. 

It is easy to show by induction that 



F\hj,Rj\ = P{0,.,j,hj,Rj) 

B[hj,Rj] = P{Oj+i.,Jhj,Rj), 
where Ok:i = {Oij\k <j< /}. 



(4) 
(5) 



4.4.3 Computing confidence probabilities From (4), (5), we can 
now easily compute confidence scores. One such score is the posterior 
probability P(hj\0\:m). It represents the probability that hj was deter- 
mined correctly with respect to hi and can be computed as 

P(hj\Oi,,,) = Ei?^. P(hj, Rj\oi,,,\ where 

PQij, Rj\0i,,,) = P(0i.j, hj, Rj)PiOj+i:„\hj, Rj)/PiOi,n). 

Next, the transition probability P(hj\hj-\, Oi:m) represents the prob- 
ability of consecutive SNPs being phased correctly; it can be used to 
detect potential errors Hke the one shown in Table 4. We compute this 
value using the identity P(hj\hj-i, Oi:m) = P(hj, hj-i\Oi:,n)/P(hj-i\Oi:m), 
where the denominator is the posterior probability and the numerator 
is computed as 



Pihj,hj_l\Oi,n) = 



ER,,R,_,P(hj,hM,Rj,Rj-uOi._^) 



where 



_ J:h,;R,- P(Oj+i:m\hj, Rj)T(hj, Rj, Oj) 
Tihj, Rj, Oj) = J2 ^W^P Rj)Pihj)P(Rj)PiOi,, hj, Rj) 



Additionally, we found that the emission probability P{Oj\hjRj) was 
useful in detecting errors in the data. Computing this value only involves 
the expression P{Oj\hjRj) = UrjePoH) Pioijln, hj). 

Finally, note that in general, one can compute any set of probabilities 
P(hk\hi, Oi-m) in the model. However, this involves doing potentially up 
to a full run of message passing. 



4.5 A merging heuristic 

The exact dynamic programming algorithm described above is practical 
for coverages of up to 10-12x. For deeper or for highly uneven cover- 
ages, we propose a simple preprocessing heuristic. The heuristic consists 
in reducing the coverage by repeatedly merging reads that are likely to 
come from the same haplotypes until there are no reads that we can 
confidently merge. 

To determine whether to merge reads k, /, we consider the ratio 



{Piokj, 0, 0)Pioij, 1,0) + P(okj, 1, 0)Pioij, 0, 0)) 
{P{oki, 0, 0)P(o,,, 0,0) + P(ot,, 1 , 0)P(o,j, 1,0))' 



(6) 



where P(okj,x,y) is shorthand for P(oiJ, rk = x, hj= y). Intuitively, the 
denominator is associated with the likelihood that the two reads come 
from the same haplotype and the numerator is associated with the Hke- 
lihood that the reads' origins are different. Both terms are estimated by a 
heuristic formula that decomposes over each position. If reads k, I are 
merged, then position j of the resulting new read is assigned the allele that 
has the highest q-score in the initial reads k, I (i.e. arg maxkjiQkj, Qij})', 
the q-score at that position is set to the difference of the initial reads' 
q-scores (i.e. \Qkj - Qij\)- 

In practice, one may select a confidence threshold for (6) and only 
merge reads that are below this threshold. We found empirically a 
value of 1 — 10~^ to work well. 



4.6 A post-processing heuristic 

In addition, ProbHap admits an extra post-processing heuristic for ad- 
justing the optimal haplotypes h*. This heuristic was initially proposed for 
the algorithm RefHap; ProbHap currently uses it by default, although it 
can be disabled. The heuristic starts with the optimal read assignments r* 
and determines at each position J a pair of sets 

^,-,0 = {i\(ri = 0 n My = 0) U (n = 1 n My = 1)} 

Sj-i = {i\(n = 0 n My = 1) u (n = i n My = O)}. 

It then outputs a new haplotype /?"^^ defined as 

0 if |5y,ol>l^/,ll 

1 if |5,,o|<|^;,ll 

— otherwise. 



/,new . 

7 



We found that this heuristic increases the short switch accuracy of 
ProbHap on the NA12878 dataset; the long switch accuracy remains 
the same. We suggest using this heuristic in settings where the quality 
scores may not be well calibrated. 



5 DISCUSSION: THEORETICAL ASPECTS 

Interestingly, the probabihstic framework of ProbHap general- 
izes the siH formalism on which most existing methods are based. 
This allows us to easily derive well-known exact dynamic pro- 
gramming algorithms as special cases of the variable elimination 
algorithm for graphical models. More interestingly, the variable 
elimination algorithm with different variable orderings results in 
novel exact algorithms that are far more efficient than existing 
ones. 



5.1 Generalizing the sih framework 

In its standard formulation, the sm problem consists in finding a 
haplotype h that minimizes the MEC criterion: 

MEC(/z, M) 



J:jeFo(i) fjeFo(i) 



where / : {True, False} ^ {0, 1} is the indicator function, and the 
remaining notation is the same as defined in the Section 4. The 
MEC measures the total number of positions within all the reads 
that need to be corrected to make the reads consistent with a 
haplotype h. 

It is easy to show that the MEC objective can be recovered as a 
special case of our framework. Indeed, if we define the factors 
(p(oij, f/, hj) (which we have previously set to P(oij\ri, hj)) in a way 
that 



(t){0ij, ri, hj) -- 



exp (1) if Oij ^ hjin) 
exp (0) if Oij = hj(ri), 



then log P(M, r, h) equals MEC(/z, M), although P is no longer a 
probabihty. 

Thus, our dynamic programming algorithms can also produce 
exact solutions to the MEC objective, and just as interestingly, 
they can produce confidence probabilities associated with the 
MEC. 
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5.2 Rederiving existing sm algorithms 

Interestingly, we can easily recover an existing dynamic program- 
ming algorithm (He et al., 2010) for the MEC as a special case 
of variable elimination in our graphical model. Indeed, con- 
sider the junction tree defined by n variable clusters 
C/ = {r/, hj, Oij\j e Po(0} connected into a path ordered by /. If 
we assume for simplicity that the data have no contained reads, 
then the message from cluster i-\ to cluster / during a run of 
max-sum message passing with C„ as the junction tree root 
equals precisely 

M(Hini+ 1) = max max Y] log P(oij\ri, hj) + M(i7/_in/) , 

(7) 

where i//n/+ i^Hid Hi+ 1 and 1 = Hi\Hi+ 1 . This is essen- 
tially the well-known dynamic programming recursion (He et al., 
2010) we were looking to find. 

Unfortunately, the time to compute the above recursion in- 
creases exponentially in the length of the reads, which is precisely 
the data we want to use for phasing. 

5.3 Deriving novel sm algorithms 

Fortunately, as we have seen, we can derive from our framework 
exact algorithms that are suitable for long read data. 
Interestingly, these methods are in a sense dual to equation (7): 
the structure of the probabilistic model P is entirely symmetric in 
r, h. If we reverse h and r in Section 4, we obtain recursion (7). 

Potentially, our framework allows deriving other exact algo- 
rithms by defining alternative junction trees for the max-sum 
message passing algorithm. One way to do this involves using 
minimizing their tree-width using some well-known heuristics 
(Koller and Friedman, 2009). Because the running time max- 
sum message passing is exponential in the tree-width of a junc- 
tion tree, this would lead to much faster running times. 

6 CONCLUSION 

In summary, we have introduced a new single-individual phasing 
algorithm, ProbHap, that offers an 1 1 % improvement in accur- 
acy over the current state-of-the-art method, RefHap. In add- 
ition, it is one of the only methods to provide the user with 
confidence scores at every position; these confidence scores can 
be used to prune positions whose phase is uncertain and thus 
substantially increase the overall accuracy. 

The advances behind ProbHap are made possible by framing 
the phasing problem within a probabilistic graphical models 
framework. This framework makes it particularly easy to 
reason about the problem; in fact, all our algorithms are special 
cases of standard procedures for optimizing graphical models. 

On the theoretical side, this work generalizes the MEC criter- 
ion used by existing methods. Our approach allows us to obtain 
existing algorithms as special cases of well-known optimization 



procedures, and also easily derive new, more efficient algorithms; 
it may thus serve as a foundation for further algorithmic insights. 
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